{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Catalyst example on table-data\n",
    "@DBusAI"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from collections import OrderedDict\n",
    "import numpy as np\n",
    "from matplotlib.pylab import plt\n",
    "%matplotlib inline\n",
    "from sklearn.datasets.california_housing import fetch_california_housing\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.metrics import mean_squared_error\n",
    "\n",
    "import torch\n",
    "from torch import nn\n",
    "import torch.nn.functional as F\n",
    "from torch.utils.data import TensorDataset, DataLoader\n",
    "\n",
    "from catalyst.dl import SupervisedRunner\n",
    "from catalyst.dl.callbacks import SchedulerCallback\n",
    "from catalyst.contrib.optimizers import Lookahead\n",
    "from catalyst.utils import set_global_seed"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Reproduce all\n",
    "Catalyst provides a special utils for research results reproducibility."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "SEED=42\n",
    "set_global_seed(SEED)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get some data\n",
    "In this tutorial we will use \n",
    "[California dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html )<br>\n",
    "Also, we split all data: <b>75/25</b> - for training /validation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X, y = fetch_california_housing(return_X_y=True)\n",
    "x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=SEED)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Dataset definition\n",
    "\n",
    "We have to normalize all X-data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mscl = StandardScaler()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x_train = mscl.fit_transform(x_train)\n",
    "x_test = mscl.transform(x_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And prepare PyTorch Datasets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_ds = TensorDataset(torch.FloatTensor(x_train), torch.FloatTensor(y_train.reshape(-1,1)))\n",
    "test_ds = TensorDataset(torch.FloatTensor(x_test), torch.FloatTensor(y_test.reshape(-1,1)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### DataLoader definition\n",
    "\n",
    "We have to define bacth size and shuffle train data: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "batch = 120\n",
    "\n",
    "train_dl = DataLoader(train_ds, batch_size=batch, shuffle=True, num_workers=2)\n",
    "test_dl = DataLoader(test_ds, batch_size=batch, shuffle=False, num_workers=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Catalyst loader:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = OrderedDict()\n",
    "data['train'] = train_dl\n",
    "data['valid'] = test_dl"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Define model\n",
    "\n",
    "Our Neural Network structure will be very simple. Just MLP with 40,20,1 linear layers. Also, default initialization. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class Net(nn.Module):\n",
    "    def __init__(self, num_features):\n",
    "        super(Net,self).__init__()\n",
    "        layers = [40, 20]\n",
    "        self.L1 = nn.Linear(num_features, layers[0])\n",
    "        torch.nn.init.xavier_uniform_(self.L1.weight) \n",
    "        torch.nn.init.zeros_(self.L1.bias)\n",
    "        \n",
    "        self.L2 = nn.Linear(layers[0], layers[1])\n",
    "        torch.nn.init.xavier_uniform_(self.L2.weight) \n",
    "        torch.nn.init.zeros_(self.L2.bias)\n",
    "        \n",
    "        self.L3 = nn.Linear(layers[1], 1)\n",
    "        torch.nn.init.xavier_uniform_(self.L3.weight) \n",
    "        torch.nn.init.zeros_(self.L3.bias)\n",
    "    def forward(self, x):\n",
    "        x = F.relu(self.L1(x))\n",
    "        x = F.relu(self.L2(x))\n",
    "        x = F.relu(self.L3(x))\n",
    "        return x"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = Net(x_train.shape[1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Default optimizer and <b>L2 loss</b>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "optimizer = torch.optim.Adam(model.parameters(), lr=1e-2)\n",
    "crit = nn.MSELoss()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For model training we need SupervisedRunner and train method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "runner = SupervisedRunner()\n",
    "runner.train(\n",
    "    model=model,\n",
    "    criterion=crit,\n",
    "    optimizer=optimizer,\n",
    "    loaders=data,\n",
    "    logdir=\"run\",\n",
    "    num_epochs=20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Inference\n",
    "\n",
    "Inference part is mush easier: <br>\n",
    "<b>/checkpoints/best.pth</b> - is default dir for checkpoints<br>\n",
    "<b>run</b> - our logdir"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "predictions = runner.predict_loader(\n",
    "    model, data[\"valid\"], resume=f\"run/checkpoints/best.pth\", verbose=True\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Results\n",
    "\n",
    "Let's calculate MSE error "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mean_squared_error(y_test, predictions)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  Prediction Viz\n",
    "\n",
    "And finally - show scatterplot for our predictions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.scatter(y_test, predictions.flatten())"
   ]
  }
 ],
 "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.7.0"
    }
  },
 "nbformat": 4,
 "nbformat_minor": 2
}
